Let’s use CoNLL 2002 data to build a NER system (sklearn-crfsuite)

#sklearn-crfsuite Tutorial

https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

Let's use CoNLL 2002 data to build a NER system (python-crfsuite)のリライト

このメモではGridSearchまでは踏み込まない（TODO）

conll2002のSpanishを使う

#nltk

token, postag, labelの3つ組

Features

In this example we use word identity, word suffix, word shape and word POS tag; also, some information from nearby words is used.

「語の同一性、語の接尾辞、語形、語の品詞(POS)タグを使う。付近の単語の情報も使う」

同一性として小文字化

接尾辞としてトークンの後ろ3文字や後ろ2文字を使っている

語形として、upper case / title case / digit

sklearn-crfsuite (and python-crfsuite) supports several feature formats; here we use feature dicts.

ここでは辞書形式のfeature

Training

To see all possible CRF parameters check its docstring.

https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/0.3.6/sklearn_crfsuite/estimator.py#L14-L209

Here we are useing L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.

デフォルトのL-BFGSアルゴリズム

Elastic Net正則化（L1 + L2）

Evaluation

There is much more O entities in data set, but we’re more interested in other entities. To account for this we’ll use averaged F1 score computed for all labels except for O.

「Oラベルを除いて平均F1スコアを計算」

データセットにOというエンティティはずっとたくさんあるが、O以外のエンティティに興味があるため

flat_classification_report

2022/06時点で最新のscikit-learnではclassification_reportにキーワード専用引数が導入されていたため落ちた

TypeError: classification_report() takes 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given

https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/0.3.6/sklearn_crfsuite/metrics.py#L68

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

*, labels=None

code:実行結果

$ python -i conll2002.py

Train: 8323 8323

Test: 1517 1517

f1_score=0.7964686316443963

precision recall f1-score support

B-LOC 0.810 0.784 0.797 1084

I-LOC 0.690 0.637 0.662 325

B-MISC 0.731 0.569 0.640 339

I-MISC 0.699 0.589 0.639 557

B-ORG 0.807 0.832 0.820 1400

I-ORG 0.852 0.786 0.818 1104

B-PER 0.850 0.884 0.867 735

I-PER 0.893 0.943 0.917 634

micro avg 0.813 0.787 0.799 6178

macro avg 0.791 0.753 0.770 6178

weighted avg 0.809 0.787 0.796 6178